Since the outbreak of the COVID-19 pandemic at the beginning of 2020, the scientific circles have concentrated on studying the structure and characteristics of SARS-CoV-2 virus in order to invent an effective vaccine in the shortest possible time. However, before accomplishing that task it’s crucial to find accurate indicators, which would be helpful in early diagnosis. The aim of the following analysis was to study medical data collected in the early state of pandemic in search of interesting patterns. The first section showed preparation of the environment, which assures the repeatability of the analysis with each run. Next section introduced the original structure of source data, presented basic statistics and walked through the following steps of data processing until obtaining a clean dataset. For better understanding of each variable, the codebook was prepared and can be found in the Data description section. The main part of the analysis begins in the Dataset analysis section. Trends over time indicated that the number of infected patients started growing rapidly around January 27, 2020 and the sooner an infected patient got under medical care, the higher chances of survival he or she had. The majority of people who got to the hospital already in severe condition had died. The elderly and people with preexisting severe health conditions reported to have a higher fatality rate of COVID-19. The blood test results suggested that among available types of blood exams the most reliable biomarkers to detect coronavirus-19 infection were hs-cTnI, NT-proBNP and hs-CRP. Attribute correlation study confirmed that levels of substances tested in standard Blood Biochemistry, Complete Blood Count and Coagulation tests are correlated in each of these categories. The classifier detected that for the presented dataset the crucial tests in outcome of treatment prediction were: LDH level, amounts of neutrophils and lymphocytes, high sensitivity C-reactive protein and albumin protein levels and most of available coagulation tests.
Setting global options and initialization of the seed variable, which is necessary for ensuring reproducibility.
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
set.seed(23)
Loading the libraries used in the project.
library(knitr)
library(xlsx)
library(dplyr)
library(tidyr)
library(lubridate)
library(arules)
library(caret)
library(DT)
library(kableExtra)
library(plotly)
library(ggplot2)
library(corrplot)
library(codebook)
The dataset is based on the medical records of 375 patients suffering from COVID-19 collected between 10 January and 18 February 2020 in Tongji Hospital in China. The medical records were gathered by using standard case report forms, which included clinical, epidemiological, demographic, laboratory and mortality outcome information. Data collecting methods and some basic statistics are presented in Tan et al article published in Nature, 14 May 2020.
This section covers loading source data, initial remarks about its content, cleaning phase and detailed dataset description.
The dataset can be downloaded under this link. In this project it’s stored in a local subdirectory to ensure availability, even when it’s deleted from the original source.
Loading source data into a dataframe without any pre clean-up with caching option checked.
path <- '../res/wuhan_blood_sample_data_Jan_Feb_2020.xlsx'
df_raw <- read.xlsx(path, sheetIndex=1, header=TRUE)
| rows | columns | unique RE_DATE | NAs in RE_DATE | NAs in PATIENT_ID |
|---|---|---|---|---|
| 6120 | 81 | 3092 | 14 | 5745 |
In order to get a distinctive key for each line we will combine RE_DATE with PATIENT_ID values. In the first step we will remove rows with NA in RE_DATE, because after further checking we can see that all of the blood test results columns are empty for these lines. Then we will fill the PATIENT_ID column.
all(sapply(df_raw[is.na(df_raw$RE_DATE), c(8:81)], is.na))
## [1] TRUE
df <- df_raw %>%
filter(!is.na(RE_DATE)) %>%
fill(PATIENT_ID)
nrow(df)
## [1] 6106
length(unique(paste(df$PATIENT_ID, df$RE_DATE)))
## [1] 6106
The columns can be divided into 2 categories:
Short notes about the first 7 columns, containing information about a patient, are shown below.
## 'data.frame': 6106 obs. of 7 variables:
## $ PATIENT_ID : num 1 1 1 1 1 1 1 1 1 1 ...
## $ RE_DATE : POSIXct, format: "2020-01-31 01:09:00" "2020-01-31 01:25:00" ...
## $ age : num 73 73 73 73 73 73 73 73 73 73 ...
## $ gender : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Admission.time: POSIXct, format: "2020-01-30 22:12:47" "2020-01-30 22:12:47" ...
## $ Discharge.time: POSIXct, format: "2020-02-17 12:40:08" "2020-02-17 12:40:08" ...
## $ outcome : num 0 0 0 0 0 0 0 0 0 0 ...
As we can see, the gender and outcome of treatment columns are encoded with numbers. Both of these attributes have only 2 unique values, so we will transform them into factors for more clarity. The information about encoding was found in the article mentioned before, which was used as a reference in this analysis.
df <- df %>%
mutate(gender = factor(gender, labels = c('male', 'female')),
outcome = factor(outcome, labels = c('survived', 'deceased')))
The next table presents a summary for the remaining columns, which store results of specific blood tests.
| Hypersensitive.cardiac.troponinI | hemoglobin | Serum.chloride | Prothrombin.time | procalcitonin | eosinophils… | Interleukin.2.receptor | Alkaline.phosphatase | albumin | basophil… | Interleukin.10 | Total.bilirubin | Platelet.count | monocytes… | antithrombin | Interleukin.8 | indirect.bilirubin | Red.blood.cell.distribution.width. | neutrophils… | total.protein | Quantification.of.Treponema.pallidum.antibodies | Prothrombin.activity | HBsAg | mean.corpuscular.volume | hematocrit | White.blood.cell.count | Tumor.necrosis.factorα | mean.corpuscular.hemoglobin.concentration | fibrinogen | Interleukin.1β | Urea | lymphocyte.count | PH.value | Red.blood.cell.count | Eosinophil.count | Corrected.calcium | Serum.potassium | glucose | neutrophils.count | Direct.bilirubin | Mean.platelet.volume | ferritin | RBC.distribution.width.SD | Thrombin.time | X…lymphocyte | HCV.antibody.quantification | D.D.dimer | Total.cholesterol | aspartate.aminotransferase | Uric.acid | HCO3. | calcium | Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP. | Lactate.dehydrogenase | platelet.large.cell.ratio. | Interleukin.6 | Fibrin.degradation.products | monocytes.count | PLT.distribution.width | globulin | γ.glutamyl.transpeptidase | International.standard.ratio | basophil.count… | X2019.nCoV.nucleic.acid.detection | mean.corpuscular.hemoglobin. | Activation.of.partial.thromboplastin.time | High.sensitivity.C.reactive.protein | HIV.antibody.quantification | serum.sodium | thrombocytocrit | ESR | glutamic.pyruvic.transaminase | eGFR | creatinine | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.9 | Min. : 6.4 | Min. : 71.50 | Min. : 11.50 | Min. : 0.020 | Min. :0.000 | Min. : 61.0 | Min. : 17.00 | Min. :13.60 | Min. :0.00 | Min. : 5.00 | Min. : 2.50 | Min. : -1.0 | Min. : 0.300 | Min. : 20.00 | Min. : 5.000 | Min. : 0.100 | Min. :10.60 | Min. : 1.7 | Min. :31.80 | Min. : 0.020 | Min. : 6.00 | Min. : 0.000 | Min. : 61.60 | Min. :14.50 | Min. : 0.13 | Min. : 4.00 | Min. :286.0 | Min. : 0.500 | Min. : 5.00 | Min. : 0.800 | Min. : 0.000 | Min. :5.000 | Min. : 0.100 | Min. :0.000 | Min. :1.650 | Min. : 2.760 | Min. : 1.000 | Min. : 0.06 | Min. : 1.600 | Min. : 8.50 | Min. : 17.8 | Min. : 31.30 | Min. : 13.00 | Min. : 0.000 | Min. :0.020 | Min. : 0.210 | Min. :0.100 | Min. : 6.00 | Min. : 43.0 | Min. : 6.30 | Min. :1.170 | Min. : 5 | Min. : 110.0 | Min. :11.20 | Min. : 1.500 | Min. : 4.00 | Min. : 0.010 | Min. : 8.00 | Min. :10.10 | Min. : 3.00 | Min. : 0.840 | Min. :0.000 | Min. :-1 | Min. :20.4 | Min. : 21.80 | Min. : 0.10 | Min. :0.05 | Min. :115.4 | Min. :0.010 | Min. : 1.00 | Min. : 5.00 | Min. : 2.00 | Min. : 11.00 | |
| 1st Qu.: 4.4 | 1st Qu.:113.0 | 1st Qu.: 99.05 | 1st Qu.: 13.60 | 1st Qu.: 0.040 | 1st Qu.:0.000 | 1st Qu.: 459.5 | 1st Qu.: 54.00 | 1st Qu.:27.40 | 1st Qu.:0.10 | 1st Qu.: 5.00 | 1st Qu.: 7.40 | 1st Qu.:109.0 | 1st Qu.: 2.800 | 1st Qu.: 74.00 | 1st Qu.: 8.675 | 1st Qu.: 3.800 | 1st Qu.:12.00 | 1st Qu.:65.1 | 1st Qu.:61.00 | 1st Qu.: 0.040 | 1st Qu.: 65.00 | 1st Qu.: 0.000 | 1st Qu.: 86.90 | 1st Qu.:33.50 | 1st Qu.: 4.94 | 1st Qu.: 6.70 | 1st Qu.:333.0 | 1st Qu.: 3.050 | 1st Qu.: 5.00 | 1st Qu.: 4.000 | 1st Qu.: 0.460 | 1st Qu.:6.000 | 1st Qu.: 3.680 | 1st Qu.:0.000 | 1st Qu.:2.270 | 1st Qu.: 3.950 | 1st Qu.: 5.550 | 1st Qu.: 3.09 | 1st Qu.: 3.225 | 1st Qu.:10.10 | 1st Qu.: 377.2 | 1st Qu.: 38.50 | 1st Qu.: 15.60 | 1st Qu.: 3.925 | 1st Qu.:0.040 | 1st Qu.: 0.602 | 1st Qu.:3.010 | 1st Qu.: 19.50 | 1st Qu.: 183.2 | 1st Qu.:21.00 | 1st Qu.:1.980 | 1st Qu.: 150 | 1st Qu.: 218.0 | 1st Qu.:25.60 | 1st Qu.: 4.772 | 1st Qu.: 4.00 | 1st Qu.: 0.270 | 1st Qu.:11.10 | 1st Qu.:29.70 | 1st Qu.: 22.00 | 1st Qu.: 1.030 | 1st Qu.:0.010 | 1st Qu.:-1 | 1st Qu.:29.7 | 1st Qu.: 35.30 | 1st Qu.: 5.70 | 1st Qu.:0.07 | 1st Qu.:137.7 | 1st Qu.:0.150 | 1st Qu.: 14.00 | 1st Qu.: 16.00 | 1st Qu.: 63.58 | 1st Qu.: 58.00 | |
| Median : 20.6 | Median :125.0 | Median :102.10 | Median : 14.80 | Median : 0.100 | Median :0.100 | Median : 676.5 | Median : 69.50 | Median :32.20 | Median :0.20 | Median : 5.90 | Median : 10.70 | Median :178.0 | Median : 5.700 | Median : 86.00 | Median : 16.000 | Median : 5.400 | Median :12.60 | Median :82.4 | Median :65.90 | Median : 0.050 | Median : 81.00 | Median : 0.010 | Median : 90.10 | Median :36.60 | Median : 7.72 | Median : 8.60 | Median :343.0 | Median : 4.120 | Median : 5.00 | Median : 5.985 | Median : 0.800 | Median :6.500 | Median : 4.140 | Median :0.010 | Median :2.360 | Median : 4.410 | Median : 6.990 | Median : 5.85 | Median : 4.800 | Median :10.80 | Median : 711.0 | Median : 40.90 | Median : 16.80 | Median :11.450 | Median :0.060 | Median : 2.155 | Median :3.630 | Median : 27.00 | Median : 243.7 | Median :23.50 | Median :2.080 | Median : 585 | Median : 340.0 | Median :30.90 | Median : 19.265 | Median : 17.90 | Median : 0.410 | Median :12.40 | Median :32.70 | Median : 34.00 | Median : 1.140 | Median :0.010 | Median :-1 | Median :30.9 | Median : 39.20 | Median : 51.50 | Median :0.09 | Median :140.4 | Median :0.210 | Median : 28.00 | Median : 24.00 | Median : 87.90 | Median : 76.00 | |
| Mean : 1223.2 | Mean :123.1 | Mean :103.14 | Mean : 16.68 | Mean : 1.107 | Mean :0.629 | Mean : 907.2 | Mean : 82.47 | Mean :32.01 | Mean :0.21 | Mean : 16.07 | Mean : 16.70 | Mean :184.3 | Mean : 6.155 | Mean : 85.32 | Mean : 83.088 | Mean : 6.889 | Mean :13.07 | Mean :77.6 | Mean :65.30 | Mean : 0.132 | Mean : 78.55 | Mean : 8.306 | Mean : 90.39 | Mean :36.55 | Mean : 15.60 | Mean : 11.58 | Mean :342.8 | Mean : 4.294 | Mean : 6.51 | Mean : 9.589 | Mean : 1.017 | Mean :6.484 | Mean : 9.288 | Mean :0.039 | Mean :2.355 | Mean : 4.509 | Mean : 8.889 | Mean : 7.81 | Mean : 9.887 | Mean :10.91 | Mean : 1379.1 | Mean : 42.44 | Mean : 18.17 | Mean :15.392 | Mean :0.117 | Mean : 7.943 | Mean :3.689 | Mean : 46.53 | Mean : 276.1 | Mean :23.14 | Mean :2.078 | Mean : 3669 | Mean : 474.2 | Mean :31.77 | Mean : 112.308 | Mean : 61.35 | Mean : 0.526 | Mean :13.01 | Mean :33.24 | Mean : 55.34 | Mean : 1.313 | Mean :0.017 | Mean :-1 | Mean :31.0 | Mean : 41.52 | Mean : 76.24 | Mean :0.10 | Mean :141.6 | Mean :0.212 | Mean : 33.69 | Mean : 38.86 | Mean : 81.56 | Mean : 109.93 | |
| 3rd Qu.: 223.8 | 3rd Qu.:137.0 | 3rd Qu.:105.65 | 3rd Qu.: 16.70 | 3rd Qu.: 0.405 | 3rd Qu.:0.800 | 3rd Qu.:1155.5 | 3rd Qu.: 95.00 | 3rd Qu.:36.60 | 3rd Qu.:0.30 | 3rd Qu.: 12.35 | 3rd Qu.: 16.77 | 3rd Qu.:248.0 | 3rd Qu.: 8.600 | 3rd Qu.: 97.00 | 3rd Qu.: 35.200 | 3rd Qu.: 8.000 | 3rd Qu.:13.70 | 3rd Qu.:92.3 | 3rd Qu.:70.45 | 3rd Qu.: 0.070 | 3rd Qu.: 95.00 | 3rd Qu.: 0.010 | 3rd Qu.: 93.90 | 3rd Qu.:39.90 | 3rd Qu.: 12.72 | 3rd Qu.: 11.50 | 3rd Qu.:350.0 | 3rd Qu.: 5.480 | 3rd Qu.: 5.00 | 3rd Qu.:11.400 | 3rd Qu.: 1.310 | 3rd Qu.:7.294 | 3rd Qu.: 4.650 | 3rd Qu.:0.060 | 3rd Qu.:2.440 | 3rd Qu.: 4.870 | 3rd Qu.:10.260 | 3rd Qu.:10.95 | 3rd Qu.: 8.275 | 3rd Qu.:11.50 | 3rd Qu.: 1425.2 | 3rd Qu.: 44.70 | 3rd Qu.: 18.38 | 3rd Qu.:24.975 | 3rd Qu.:0.090 | 3rd Qu.:21.000 | 3rd Qu.:4.265 | 3rd Qu.: 42.00 | 3rd Qu.: 333.8 | 3rd Qu.:25.90 | 3rd Qu.:2.190 | 3rd Qu.: 2625 | 3rd Qu.: 601.8 | 3rd Qu.:37.20 | 3rd Qu.: 60.167 | 3rd Qu.:150.00 | 3rd Qu.: 0.580 | 3rd Qu.:14.30 | 3rd Qu.:36.50 | 3rd Qu.: 58.00 | 3rd Qu.: 1.330 | 3rd Qu.:0.020 | 3rd Qu.:-1 | 3rd Qu.:32.2 | 3rd Qu.: 44.12 | 3rd Qu.:118.50 | 3rd Qu.:0.11 | 3rd Qu.:143.5 | 3rd Qu.:0.270 | 3rd Qu.: 45.50 | 3rd Qu.: 41.00 | 3rd Qu.:103.97 | 3rd Qu.: 98.25 | |
| Max. :50000.0 | Max. :178.0 | Max. :140.40 | Max. :120.00 | Max. :57.170 | Max. :8.600 | Max. :7500.0 | Max. :620.00 | Max. :48.60 | Max. :1.70 | Max. :1000.00 | Max. :505.70 | Max. :558.0 | Max. :53.000 | Max. :136.00 | Max. :6795.000 | Max. :145.100 | Max. :27.10 | Max. :98.9 | Max. :88.70 | Max. :11.950 | Max. :142.00 | Max. :250.000 | Max. :118.90 | Max. :52.30 | Max. :1726.60 | Max. :168.00 | Max. :514.0 | Max. :10.780 | Max. :88.50 | Max. :68.400 | Max. :52.420 | Max. :7.565 | Max. :749.500 | Max. :0.490 | Max. :2.790 | Max. :12.800 | Max. :43.010 | Max. :33.88 | Max. :360.600 | Max. :15.00 | Max. :50000.0 | Max. :113.30 | Max. :161.90 | Max. :60.000 | Max. :2.090 | Max. :60.000 | Max. :7.300 | Max. :1858.00 | Max. :1176.0 | Max. :36.30 | Max. :2.620 | Max. :70000 | Max. :1867.0 | Max. :62.20 | Max. :5000.000 | Max. :190.80 | Max. :39.920 | Max. :25.30 | Max. :50.60 | Max. :732.00 | Max. :13.480 | Max. :0.120 | Max. :-1 | Max. :50.8 | Max. :144.00 | Max. :320.00 | Max. :0.27 | Max. :179.7 | Max. :0.510 | Max. :110.00 | Max. :1600.00 | Max. :224.00 | Max. :1497.00 | |
| NA’s :5599 | NA’s :5131 | NA’s :5131 | NA’s :5444 | NA’s :5647 | NA’s :5149 | NA’s :5838 | NA’s :5176 | NA’s :5172 | NA’s :5149 | NA’s :5839 | NA’s :5176 | NA’s :5149 | NA’s :5148 | NA’s :5776 | NA’s :5838 | NA’s :5200 | NA’s :5183 | NA’s :5149 | NA’s :5175 | NA’s :5827 | NA’s :5447 | NA’s :5827 | NA’s :5149 | NA’s :5149 | NA’s :4979 | NA’s :5838 | NA’s :5149 | NA’s :5540 | NA’s :5838 | NA’s :5170 | NA’s :5149 | NA’s :5722 | NA’s :4979 | NA’s :5149 | NA’s :5192 | NA’s :5126 | NA’s :5331 | NA’s :5149 | NA’s :5176 | NA’s :5244 | NA’s :5823 | NA’s :5183 | NA’s :5540 | NA’s :5148 | NA’s :5827 | NA’s :5476 | NA’s :5175 | NA’s :5171 | NA’s :5172 | NA’s :5172 | NA’s :5127 | NA’s :5631 | NA’s :5172 | NA’s :5244 | NA’s :5834 | NA’s :5776 | NA’s :5149 | NA’s :5244 | NA’s :5176 | NA’s :5176 | NA’s :5447 | NA’s :5149 | NA’s :5605 | NA’s :5149 | NA’s :5538 | NA’s :5369 | NA’s :5828 | NA’s :5131 | NA’s :5244 | NA’s :5723 | NA’s :5175 | NA’s :5170 | NA’s :5170 |
The author of the original dataset decided to label columns with full-length names of the blood tests, which are too long and incomprehensible for our purposes. Also in some of names occur special symbols, which were degenerated during data loading. To obtain easy to work with column names the following steps were performed:
df <- df %>%
rename("hs.cTnI" = "Hypersensitive.cardiac.troponinI") %>%
rename("PT" = "Prothrombin.time") %>%
rename("IL.2.receptor" = "Interleukin.2.receptor") %>%
rename("ALP" = "Alkaline.phosphatase") %>%
rename("IL.10" = "Interleukin.10") %>%
rename("AT" = "antithrombin") %>%
rename("IL.8" = "Interleukin.8") %>%
rename("RDW" = "Red.blood.cell.distribution.width.") %>%
rename("q.anti.TP" = "Quantification.of.Treponema.pallidum.antibodies") %>%
rename("MCV" = "mean.corpuscular.volume") %>%
rename("WBC.count" = "White.blood.cell.count") %>%
rename("TNF.alpha" = "Tumor.necrosis.factorα") %>%
rename("MCHC" = "mean.corpuscular.hemoglobin.concentration") %>%
rename("IL.1.beta" = "Interleukin.1β") %>%
rename("RBC.count" = "Red.blood.cell.count") %>%
rename("MPV" = "Mean.platelet.volume") %>%
rename("RBC.DW.SD" = "RBC.distribution.width.SD") %>%
rename("TT" = "Thrombin.time") %>%
rename("q.anti.HCV" = "HCV.antibody.quantification") %>%
rename("D.Dimer" = "D.D.dimer") %>%
rename("AST" = "aspartate.aminotransferase") %>%
rename("NT.proBNP" = "Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP.") %>%
rename("LDH" = "Lactate.dehydrogenase") %>%
rename("P.LCR" = "platelet.large.cell.ratio.") %>%
rename("IL.6" = "Interleukin.6") %>%
rename("FDP" = "Fibrin.degradation.products") %>%
rename("PDW" = "PLT.distribution.width") %>%
rename("GGT" = "γ.glutamyl.transpeptidase") %>%
rename("INR" = "International.standard.ratio") %>%
rename("X2019.nCoV.NAT" = "X2019.nCoV.nucleic.acid.detection") %>%
rename("MCH" = "mean.corpuscular.hemoglobin.") %>%
rename("aPTT" = "Activation.of.partial.thromboplastin.time") %>%
rename("hs.CRP" = "High.sensitivity.C.reactive.protein") %>%
rename("q.anti.HIV" = "HIV.antibody.quantification") %>%
rename("ALAT" = "glutamic.pyruvic.transaminase") %>%
rename_with(~gsub('_', '.', .)) %>%
rename_with(~gsub('\\.{3}$', '.percent', .)) %>%
rename_with(~gsub('^X\\.{3}|\\.$', '', .)) %>%
rename_with(tolower, matches("^[[:upper:]]{1}[[:lower:]]+|^[[:upper:]]{2,}\\.{1}[[:upper:]]{2,}$", ignore.case = FALSE))
Obtained names are presented below.
## [1] "patient.id" "re.date" "age"
## [4] "gender" "admission.time" "discharge.time"
## [7] "outcome" "hs.cTnI" "hemoglobin"
## [10] "serum.chloride" "PT" "procalcitonin"
## [13] "eosinophils.percent" "IL.2.receptor" "ALP"
## [16] "albumin" "basophil.percent" "IL.10"
## [19] "total.bilirubin" "platelet.count" "monocytes.percent"
## [22] "AT" "IL.8" "indirect.bilirubin"
## [25] "RDW" "neutrophils.percent" "total.protein"
## [28] "q.anti.TP" "prothrombin.activity" "HBsAg"
## [31] "MCV" "hematocrit" "WBC.count"
## [34] "TNF.alpha" "MCHC" "fibrinogen"
## [37] "IL.1.beta" "urea" "lymphocyte.count"
## [40] "PH.value" "RBC.count" "eosinophil.count"
## [43] "corrected.calcium" "serum.potassium" "glucose"
## [46] "neutrophils.count" "direct.bilirubin" "MPV"
## [49] "ferritin" "RBC.DW.SD" "TT"
## [52] "lymphocyte" "q.anti.HCV" "D.Dimer"
## [55] "total.cholesterol" "AST" "uric.acid"
## [58] "HCO3" "calcium" "NT.proBNP"
## [61] "LDH" "P.LCR" "IL.6"
## [64] "FDP" "monocytes.count" "PDW"
## [67] "globulin" "GGT" "INR"
## [70] "basophil.count.percent" "X2019.nCoV.NAT" "MCH"
## [73] "aPTT" "hs.CRP" "q.anti.HIV"
## [76] "serum.sodium" "thrombocytocrit" "ESR"
## [79] "ALAT" "eGFR" "creatinine"
All of these attributes are of numeric type. In the summary we could observe that in one column some values are equal to -1. According to the article, this number was selected as a label for missing values. After checking the whole content of that column, we can see that there are only two different values {NA, -1}, which indicates that it’s filled only with missing values, thus should be removed from the dataset.
## [1] "X2019.nCoV.NAT"
## [1] NA -1
df <- df %>% select(-c(colnames(df)[apply(df, 2, function(x) any(x == -1, na.rm = TRUE))]))
The number of lines per patient is clearly inconstant, which suggests that every patient could have undergone various combinations of specific blood tests. Also, it’s common knowledge that one sample of drawn blood is enough to perform multiple tests. We would like to check if some of the blood tests available in our dataset were more likely to be performed together.
Firstly, let’s check how many unique blood tests combinations are present in the dataset.
trans <- unique(apply(!is.na(df[, 8:80]), 1, which))
length(trans)
## [1] 175
Now we will try to find closed frequent itemsets with apriori algorithm. The maximum number of items per itemset is set to 25. The maximum subset checking time is 60 seconds, support = 0.09 and confidence = 0.8.
freq_itemsets <- apriori(trans, parameter = list(support=0.09, maxlen=25, maxtime=60, target="closed frequent itemsets"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 60 0.09 1
## maxlen target ext
## 25 closed frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 15
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[73 item(s), 175 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 done [28.85s].
## filtering closed item sets ... done [16.54s].
## sorting transactions ... done [0.00s].
## writing ... [162 set(s)] done [0.27s].
## creating S4 object ... done [0.37s].
max_freq_itemsets <- freq_itemsets[is.maximal(freq_itemsets)]
itemLabels(max_freq_itemsets) <- colnames(df[, 8:80])
inspect(sort(max_freq_itemsets, by='support'))
## items support transIdenticalToItemsets count
## [1] {hs.cTnI} 0.18285714 0.000000000 32
## [2] {procalcitonin,
## NT.proBNP} 0.14857143 0.000000000 26
## [3] {q.anti.TP,
## HBsAg,
## q.anti.HCV,
## q.anti.HIV} 0.11428571 0.000000000 20
## [4] {serum.chloride,
## ALP,
## albumin,
## total.bilirubin,
## indirect.bilirubin,
## total.protein,
## urea,
## corrected.calcium,
## serum.potassium,
## glucose,
## direct.bilirubin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## calcium,
## LDH,
## globulin,
## GGT,
## hs.CRP,
## serum.sodium,
## ALAT,
## eGFR,
## creatinine} 0.11428571 0.040000000 20
## [5] {PT,
## AT,
## prothrombin.activity,
## fibrinogen,
## TT,
## D.Dimer,
## FDP,
## INR,
## aPTT} 0.10857143 0.000000000 19
## [6] {ferritin,
## hs.CRP} 0.09714286 0.005714286 17
## [7] {hemoglobin,
## eosinophils.percent,
## basophil.percent,
## platelet.count,
## monocytes.percent,
## neutrophils.percent,
## MCV,
## hematocrit,
## WBC.count,
## MCHC,
## lymphocyte.count,
## RBC.count,
## eosinophil.count,
## neutrophils.count,
## lymphocyte,
## monocytes.count,
## basophil.count.percent,
## MCH} 0.09714286 0.000000000 17
## [8] {procalcitonin,
## glucose} 0.09142857 0.022857143 16
## [9] {serum.chloride,
## PT,
## prothrombin.activity,
## serum.potassium,
## calcium,
## INR,
## serum.sodium} 0.09142857 0.000000000 16
## [10] {ALP,
## albumin,
## total.bilirubin,
## indirect.bilirubin,
## total.protein,
## urea,
## direct.bilirubin,
## ferritin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## LDH,
## globulin,
## GGT,
## ALAT,
## eGFR,
## creatinine} 0.09142857 0.000000000 16
## [11] {serum.chloride,
## ALP,
## albumin,
## total.bilirubin,
## total.protein,
## urea,
## corrected.calcium,
## serum.potassium,
## direct.bilirubin,
## ferritin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## calcium,
## LDH,
## globulin,
## GGT,
## serum.sodium,
## ALAT,
## eGFR,
## creatinine} 0.09142857 0.000000000 16
## [12] {serum.chloride,
## procalcitonin,
## ALP,
## albumin,
## total.bilirubin,
## total.protein,
## urea,
## corrected.calcium,
## serum.potassium,
## direct.bilirubin,
## total.cholesterol,
## AST,
## uric.acid,
## HCO3,
## calcium,
## LDH,
## globulin,
## GGT,
## hs.CRP,
## serum.sodium,
## ALAT,
## eGFR,
## creatinine} 0.09142857 0.000000000 16
The algorithm returned 12 closed frequent itemsets. Analyzing the statistics above, we can draw some conclusions:
For now the percent of NA values in the dataset is too high.
(sum(is.na(df)) / (nrow(df) * ncol(df))) * 100
## [1] 79.9435
To minimize the number of missing values, we will collapse all rows linked to one patient with tests’ results, which were released on the same day. In case of multiple results of the same test during one day, the latest will be saved.
df_clean <- df %>%
mutate(re.date = floor_date(re.date, unit = "day")) %>%
group_by(patient.id, re.date) %>%
summarize(across(age:outcome, last), across(hs.cTnI:creatinine, ~.[last(which(!is.na(.)))])) %>%
ungroup()
Let’s remove patients who have not been tested for at least 50% of all available tests.
ids <- df_clean %>%
group_by(patient.id) %>%
summarize(across(hs.cTnI:creatinine, ~any(!is.na(.)))) %>%
summarize(patient.id = patient.id, n = rowSums(select(., c(hs.cTnI:creatinine)))) %>%
ungroup() %>%
filter(n > ncol(select(df_clean, c(hs.cTnI:creatinine))) * 0.5)
df_clean <- df_clean %>%
filter(patient.id %in% ids$patient.id)
(sum(is.na(df_clean)) / (nrow(df_clean) * ncol(df_clean))) * 100
## [1] 51.36116
As we can see, the percentage of NA values decreased from approximately 80% to 51%, which is still high, but deleting even more records would significantly reduce the size of the dataset.
Final version of the dataset contains 1698 rows and 80 columns. Each row represents a set of blood tests results, which were performed for a single patient on a given day.
dim(df_clean)
## [1] 1698 80
Detailed description of individual columns of the dataset with some basic statistics are introduced in the codebook below.
A sample of clean data is presented below.
This section is dedicated to attribute values analysis and visualization. It covers not only separate study of each variable but also attempts of finding correlation between them.
The following chart presents how many blood tests of a given type were performed during each day from January 10 to February 18, 2020.
At first glance, we observe a significant rise in the total number of tests after January 27, which suggests that from this day forward more people infected with COVID-19 were admitted to the hospital on a daily basis. Another fact is that before that day the variety of blood test types, which were performed, was visibly lower. Throughout the whole considered period, the most frequently carried out exams were tests included in Complete Blood Count, Blood Biochemistry and glucose level categories.To confirm the suspicion of a growing number of admitted patients starting on January 27, let’s construct a chart visualizing the number of admitted and discharged patients on a given day.
The number of patients admitted to the hospital on a single day peaked on February 10, whereas the highest number of discharged patients was observed on February 18.This section is broke into 3 parts: analyzing data concerning patient’s personal information, blood test results for each type and correlation between them.
The following overlaying histograms portray age distribution of COVID-19 survivors and victims.
The majority of patients, who died for discussed disease, were people aged 65 years or older. Among younger people there were only 2 lethal victims. Beside those 2 outlier cases, people below 40 years were reported to be cured.
The next figure shows gender and outcome of treatment distributions.
In a set of 357 patients, 210 of them were male and 147 were female. The ratio of the deceased patients to the ones, who were cured, equals to approximately 0.84, which even at the time was alarmingly high.
Let’s calculate the total amount of time spent in the hospital by each patient and group these values by the outcome of treatment.
The figure shows that the mean number of days of medical follow-up for the deceased patients was less than the corresponding value for the survivors. The conclusion is that people who were reported dead were admitted to the hospital mostly in already severe condition, whereas patients who got the treatment in early onset of disease had better chances of getting cured.Based on the assumption that potential readers of this analysis are not proficient in interpreting blood test results, calculating basic statistics for each column won’t be very helpful in understanding what they mean. To avoid this, the data in each column was grouped by the outcome of treatment. Comparing results of these two classes gives us a useful point of view in deciding, which value ranges can be considered abnormal for a specific variable.
At first glance, we notice graphs with most visible contrasts between ‘survived’ and ‘deceased’ classes, which are for lymphocytes, D-Dimer, LDH, FDP and hs-CRP. These items are present in 3 different maximal frequent itemsets, which were detected earlier and classified as Blood Biochemistry, Total Blood Count and coagulation tests. Trying to analyze the plots by these categories, we observe some interesting trends for patients who died.
Blood Biochemistry:
These insights suggest possible damage to vital organs and malfunction of the excretory system caused by COVID-19.
Total Blood Count:
Coagulation tests:
Abnormal coagulation results, in particular for patients who had markedly elevated D-D and FDP, seem to go hand in hand with a higher risk of mortality associated with COVID-19.
The abnormally high result values for hs-cTnI and NT-proBNP were other observed indicators for higher disease severity. This article confirms that statement for serum high-sensitivity cardiac Troponin I, adding that it is true only for patients with no preexisting cardiovascular disease.
Similarly, extremely high levels of procalcitonin are positively associated with the severity of COVID-19 according to this publication. In this case, it can help in distinguishing between severe/critical patients and moderate patients with the disease.
In order to check if some of variables concerning blood test results values are correlated, we calculate Pearson correlation coefficient for columns [8:80], obtaining correlation matrix. Beside the value of the correlation coefficient we will take under consideration significance level as well. To do so, we calculate p-values.
The plot below presents a correlogram combined with the significance test sorted by the coefficient value. Correlations with p-value > 5% are considered as insignificant, thus the correlation coefficient values are left blank in that case. The diagonal is omitted for better readability. The correlogram reveals a couple of strongly correlated attribute clusters marked with dark blue color expressing value of correlation close to 1. Most of them are groups of tests performed alternatively or on a similar purpose.
D-Dimer and FDP, which were stated as important indicators in the previous section, are strongly correlated. The same goes for the pair of PT and INR and a bigger cluster consisting of 4 elements: LDH, hs.cTnI, NT-proBNT and ferritin.
This section describes the consecutive phases of building a classifier, which predicts whether a patient will be cured or die of COVID-19 based on his or her blood test results. The subsections are focusing on dataset preprocessing, the classifier building and analysis of attribute importance respectively.
Firstly, the dataset has to be transformed to a form where observations represent patients. To do so, we group clean dataset by patient.id and, in case of many records in one column, choose the most recent result. Patient ID, admission and discharge timestamps, as well as date of results release are no longer needed, thus shall be dropped.
df_ppl <- df_clean %>%
group_by(patient.id) %>%
fill(everything()) %>%
summarise(across(everything(), last)) %>%
ungroup() %>%
select(-c(patient.id, re.date, admission.time, discharge.time))
(sum(is.na(df_ppl)) / (nrow(df_ppl) * ncol(df_ppl))) * 100
## [1] 8.79773
The obtained dataset consists in around 9% of missing values. In attribute analysis we determined potentially useful in survival prediction columns. Let’s check how many unique numbers of NAs they contain.
## [1] 1 2 3 4 5 15 45 59 70 91 155
For columns with less than 16 NAs, the patients with those missing values will be deleted. After that, the remaining columns with any NA values will be dropped.
na_n <- sapply(df_ppl[, cols_all], function(c) sum(is.na(c), TRUE))
cols <- names(na_n)[na_n < 16]
df_ppl <- df_ppl %>%
filter(across(cols, ~!is.na(.))) %>%
select_if(~ !any(is.na(.)))
dim(df_ppl)
## [1] 343 46
colnames(df_ppl)
## [1] "age" "gender" "outcome"
## [4] "hemoglobin" "serum.chloride" "PT"
## [7] "eosinophils.percent" "ALP" "albumin"
## [10] "basophil.percent" "total.bilirubin" "platelet.count"
## [13] "monocytes.percent" "indirect.bilirubin" "neutrophils.percent"
## [16] "total.protein" "prothrombin.activity" "MCV"
## [19] "hematocrit" "WBC.count" "MCHC"
## [22] "urea" "lymphocyte.count" "RBC.count"
## [25] "eosinophil.count" "serum.potassium" "neutrophils.count"
## [28] "direct.bilirubin" "lymphocyte" "total.cholesterol"
## [31] "AST" "uric.acid" "HCO3"
## [34] "calcium" "LDH" "monocytes.count"
## [37] "globulin" "GGT" "INR"
## [40] "basophil.count.percent" "MCH" "hs.CRP"
## [43] "serum.sodium" "ALAT" "eGFR"
## [46] "creatinine"
In order to build a classifier, we divide previously prepared dataset into training (75%) and testing (25%) sets. The method used for training is repeated k-cross validation with 2 folds and 5 repeats. The model of choice is Random Forest.
inTraining <- createDataPartition(y = df_ppl$outcome, p = .75, list = FALSE)
training <- df_ppl[inTraining,]
testing <- df_ppl[-inTraining,]
rfGrid <- expand.grid(mtry = 10:30)
gridCtrl <- trainControl(
method = "repeatedcv",
summaryFunction = twoClassSummary,
classProbs = TRUE,
number = 2,
repeats = 5)
fitTune <- train(outcome ~ .,
data = training,
method = "rf",
metric = "ROC",
preProc = c("center", "scale"),
trControl = gridCtrl,
tuneGrid = rfGrid,
ntree = 30)
fitTune
## Random Forest
##
## 258 samples
## 45 predictor
## 2 classes: 'survived', 'deceased'
##
## Pre-processing: centered (45), scaled (45)
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 129, 129, 129, 129, 129, 129, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 10 0.9927119 0.9600000 0.9677966
## 11 0.9920702 0.9671429 0.9677966
## 12 0.9904358 0.9628571 0.9745763
## 13 0.9926998 0.9685714 0.9711864
## 14 0.9931840 0.9614286 0.9627119
## 15 0.9918886 0.9642857 0.9661017
## 16 0.9927119 0.9614286 0.9661017
## 17 0.9918039 0.9657143 0.9694915
## 18 0.9922518 0.9642857 0.9644068
## 19 0.9914407 0.9642857 0.9474576
## 20 0.9921186 0.9614286 0.9593220
## 21 0.9911138 0.9642857 0.9610169
## 22 0.9915133 0.9628571 0.9491525
## 23 0.9901211 0.9585714 0.9525424
## 24 0.9904116 0.9585714 0.9508475
## 25 0.9904843 0.9600000 0.9508475
## 26 0.9917070 0.9685714 0.9508475
## 27 0.9892615 0.9600000 0.9525424
## 28 0.9906659 0.9657143 0.9576271
## 29 0.9907022 0.9614286 0.9525424
## 30 0.9890557 0.9528571 0.9457627
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
ggplot(fitTune)
rfTuneOutcomes <- predict(fitTune, newdata = testing)
confusionMatrix(data = rfTuneOutcomes, testing$outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction survived deceased
## survived 46 1
## deceased 0 38
##
## Accuracy : 0.9882
## 95% CI : (0.9362, 0.9997)
## No Information Rate : 0.5412
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9763
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9744
## Pos Pred Value : 0.9787
## Neg Pred Value : 1.0000
## Prevalence : 0.5412
## Detection Rate : 0.5412
## Detection Prevalence : 0.5529
## Balanced Accuracy : 0.9872
##
## 'Positive' Class : survived
##
The accuracy of the obtained classifier equals approximately 98%.
The plot below portrays importance of attributes used in model training. The first 16 variables are included in the set of columns, which we discussed as crucial in the attribute analysis section. What’s surprising, the female gender is quite high in the ranking.
In general, we observe a higher fatality rate of COVID-19 disease in the elderly group and people with preexisted underlying health conditions. People who have been provided with medical care from very onset phases of the disease were more likely to survive. Blood biochemistry and CBC tests proved to be useful in detecting first signs of infection. Based on source data, the most accurate biomarkers in detecting patients with COVID-19 are extremely high levels of hs-cTnI, NT-proBNP and hs-CRP. The high importance of hs-CRP was confirmed by the classifier. The trained model outputted LDH and quantity of various white blood cells as the most crucial biomarkers in detecting presence of coronavirus-19 in the organism.